npj Digital Medicine
Top medRxiv preprints most likely to be published in this journal, ranked by match strength.
Show abstract
AI has shown promise in predicting surgical complications, but most existing models estimate overall risk levels rather than identifying the specific complications an individual patient may develop. We present an AI agent that uses a Virtual Patients Ensemble (VPE) approach to generate individualized predictions of surgical complications from unstructured case descriptions. The agent applies structured reasoning to extract diagnoses, surgical procedures, and risk factors from clinical narratives...
Show abstract
ImportanceArtificial Intelligence-driven analysis of laparoscopic video holds potential to increase the safety and precision of minimally invasive surgery. Vision-language models are particularly promising for video-based surgical decision support due to their capabilities to comprehend complex temporospatial (video) data. However, the same multimodal interfaces that enable such capabilities also introduce new vulnerabilities to manipulations through embedded deceptive text or images (prompt inj...
Show abstract
We present the Surgical Information Assistant, an agentic retrieval-augmented generation (RAG) system designed to improve access to surgical knowledge in resource-constrained settings. Built on the Open Manual of Surgery for Resource-Limited Settings, the assistant uses a retrieval-method we call DeRetSyn (Decom-pose-Retrieve-Synthesize). We evaluate DeRetSyn using automated metrics and partial human validation across 14,500 synthesized question-answer pairs and find that it achieves 63% top-1 a...
Show abstract
ObjectivesArtificial intelligence (AI) in chronic disease prediction often exhibits algorithmic biases, hindering equitable healthcare delivery. This study aims to develop and evaluate a Smart User Interface (Smart UI) framework that enhances fairness in diabetes prediction systems by operationalizing fairness at the human-computer interaction level, a dimension frequently overlooked in AI fairness research. Materials and MethodsWe employed a nine-metric fairness evaluation framework across fou...
Show abstract
Hallucinations in foundation models arise from autoregressive training objectives that prioritize token-likelihood optimization over epistemic accuracy, fostering overconfidence and poorly calibrated uncertainty. In clinical set- tings, where profound knowledge asymmetry exists between AI systems and end-users, undetected misinformation such as fabricated medications, contraindicated drug recommendations, or false imaging interpretations poses direct patient safety risks. We define medical hallu...
Show abstract
Generative artificial intelligence (AI) is rapidly populating medical records with synthetic or partially AI-generated content, creating a feedback loop where future models are increasingly at risk of training on uncurated AI-generated data. However, the clinical consequences of this AI-generated data contamination remain unexplored. Here, we show that in the absence of mandatory human verification, this self-referential cycle drives a rapid erosion of pathological variability and diagnostic rel...
Show abstract
BackgroundProgress in artificial intelligence-based analysis of surgical videos has been constrained by reliance on manual frame-level annotations rather than patient-level outcomes. In addition, concerns about data privacy restrict the exchange of laparoscopic video data and, thereby, multicenter collaboration. MethodsTo address these limitations, we developed a pipeline that integrates weakly supervised deep learning with Swarm Learning, a decentralized machine learning approach that enables ...
Show abstract
ObjectiveAssessing medical student performance in Objective Structured Clinical Examinations (OSCEs) is labor-intensive, requiring trained evaluators to review 15-minute long videos. The physical examination period constitutes only a small portion of these videos. Automated segmentation of OSCE videos could significantly streamline the evaluation process by detecting this physical exam portion for targeted evaluation. Current video analysis approaches struggle with these long recordings due to ...
Show abstract
BackgroundClinical decision support requires language models that provide guideline-aligned, context-aware reasoning with clear justification. Many existing benchmarks emphasize multiple-choice or short-form question answering and mainly capture factual recall rather than longitudinal clinical reasoning from extended clinical notes. Hippocrates-o1 is a family of domain-tailored clinical reasoning pipelines that combine structured prompts, guideline-informed retrieval, and iterative self-refineme...
Show abstract
BackgroundDiagnostic errors are a leading cause of preventable patient harm, often occurring during early clinical encounters where diagnostic uncertainty is maximal. Large language models (LLMs) have shown potential in medical reasoning, yet their ability to function as a diagnostic safety net, specifically by identifying and correcting human diagnostic errors, remains systematically unquantified. We evaluated whether state-of-the-art LLMs can effectively challenge, rather than merely confirm, ...
Show abstract
Rare or unexpected postoperative neurosurgical complications pose a challenge due to clinical variability and gaps in available data. We introduce the Neurosurgical Uncertainty Index (NUI), an uncertainty-aware AI framework that integrates bootstrap sampling for aleatoric uncertainty, isolation forest anomaly detection, and clinical calibration to predict and stratify risks for 13 complications. NUI distinguishes between data-driven and model-driven uncertainty and highlights cases that conventi...
Show abstract
Despite the proliferation and clinical deployment of artificial intelligence (AI)-based medical software devices, most remain black boxes that are uninterpretable to key stakeholders including patients, physicians, and even the developers of the devices. Here, we present a general model auditing framework that combines insights from medical experts with a highly expressive form of explainable AI that leverages generative models, to understand the reasoning processes of AI devices. We then apply ...
Show abstract
As large language models (LLMs) are increasingly adopted in medical decision-making, concerns about demographic biases in AIgenerated recommendations remain unaddressed. In this study, we systematically investigate how demographic attributes--specifically race and gender--affect the diagnostic, medication, and treatment decisions of LLMs. Using the MedQA dataset, we construct a controlled evaluation framework comprising 20,000 test cases with systematically varied doctor-patient demographic pair...
Show abstract
Pathology faces persistent challenges including a global shortage of specialists, uneven access to expertise, increasing diagnostic complexity, and a growing need for second-opinion consultations. While digital and telepathology platforms address parts of this problem, existing solutions often trade accessibility for structured, workflow-aware clinical integration. At the same time, multimodal medical AI shows promise for diagnostic support but raises concerns regarding transparency, automation ...
Show abstract
BackgroundCardiac surgery is one of the most complex and high-stakes areas of medicine, where intraoperative decisions must be made within seconds and incomplete information can compromise outcomes. Traditional risk scores and rule-based decision support tools provide limited real-time guidance and rarely integrate the unstructured data streams available during surgery. Recent advances in large language models (LLMs) such as OpenAIs GPT-5 and Anthropics Claude 3.5 family have demonstrated state-...
Show abstract
Global surgical care faces a severe workforce shortage, with more than 1.2 million additional specialists needed by 2030, particularly in low- and middle-income countries (LMICs). Large language models (LLMs) have demonstrated impressive medical reasoning on standardized exams, but their safety, reliability, and specialty-specific performance--especially in procedural fields such as surgery--remain uncertain. Here we evaluate over 40 state-of-the-art LLMs on 3,900 expert-authored multiple-choice...
Show abstract
ObjectiveTo evaluate the efficacy of digital twins developed using a large language model (LLaMA-3), fine-tuned with Low-Rank Adapters (LoRA) on ICU physician notes, and to determine whether specialty-specific training enhances treatment recommendation accuracy compared to other ICU specialties or zero-shot baselines. Materials and MethodsDigital twins were created using LLaMA-3 fine-tuned on discharge summaries from the MIMIC-III dataset, where medications were masked to construct training and...
Show abstract
Chronic wounds affect over 1.2 million Canadians and incur healthcare costs exceeding $13 billion annually, with global expenditures approaching $149 billion. Current clinical practice relies on manual measurements and subjective visual evaluations, which overestimate wound area by up to 40% and demonstrate poor-to-moderate inter-rater reliability. This variability complicates longitudinal monitoring and evidence-based treatment selection. We developed and evaluated an integrated mobile platform...
Show abstract
Traditional surgical training relies heavily on hands-on experiences gained through relatively infrequent procedures during apprenticeships. Recently, postoperative review has become a valuable supplement to this model, offering learning opportunities outside the operating room. However, its adoption remains limited due to its inefficiencies. In this study, we developed a Computer Vision-based system designed to efficiently navigate and retrieve critical segments from laparoscopic cholecystectom...
Show abstract
Medicine historically separates abstract clinical reasoning from physical intervention. We bridge this divide with MedOS, a general-purpose embodied world model. Mimicking human cognition via a dual-system architecture, MedOS demonstrates superior reasoning on biomedical benchmarks and autonomously executes complex clinical research. To extend this intelligence physically, the system simulates medical procedures as a physics-aware model to foresee adverse events. Generating and validating on the...